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Abstract 

We continue the line of research started in |Vig08 proposing broadword 
(a.k.a. SWAR — "SIMD Within A Register") algorithms for finding matching 
closed parentheses and the fc-th far closed parenthesis. Our algorithms work 
in time O(logui) on a word of w bits, and contain no branch and no test 
instruction. On 64-bit (and wider) architectures, these algorithms make it 
possible to avoid costly tabulations, while providing a very significant speedup 
with respect to for-loop implementations. 

1 Introduction 

A succinct data structure (e.g., a succinct tree) provides the same operations of 
its standard counterpart (and sometimes more), but occupies space that is asymp- 
totically near to the information-theoretical lower bound. A classical example is 
the (2n + l)-bit representation of a binary tree with n internal nodes proposed by 
Jacobson [Jac89j. Recent years have witnessed a growing interest in succinct data 
structures, mainly because of the explosive growth of information in various types 
of text indexes (e.g., large XML trees). 

In this paper we discuss practical implementations of two basic building blocks: 
given a string of w bits, where w is the machine word, representing open (1) and 
closed (0) parentheses, we are interested in solving the following two problems: 

• assuming the first bit is a one, finding the matching closed parenthesis; 

• finding the fc-th far closed parenthesis in the string (a parenthesis is far if its 
matching parenthesis is not in the string). 

Trivial solutions require scanning the string in 0(w) time. For the necessities of 
data structures supporting operations on balanced parenthesis, usually represent- 
ing trees (see, e.g., |Jac89| IMROll IJSS071 IGRRR06] ). the two operations can be 
implemented by tables that in principle use o(n) bits for a structure with n paren- 
theses. However, the tables are actually very big, unless n is very large, and they 
do not usually fit the processor cache. 

In this paper we push further the work started in | Vig08| , where we argued 
that on modern 64-bit architecture a much more efficient approach uses broadword 



programming. The term "broadword" has been introduced by Don Knuth in the 
fascicle on bitwise manipulation techniques of the fourth volume of The Art of 
Computer Programming [Knu07j. Broadword programming uses large (say, more 
than 64-bit wide) registers as small parallel computers, processing several pieces of 
information at a time. An alternative, more traditional name for similar techniques 
is SWAR ("SIMD Within A Register"), a term coined by Fisher and Dietz |FD99] . 
One of the first techniques for manipulating several bytes in parallel were actually 
proposed by Lamport |Lam75j . The famous HAKMEM memo |BGS72j contains 
several examples of broadword programming. 

We are also very careful of avoiding tests whenever possible. Branching is 
a very expensive operation that disrupts speculative execution, and should be 
avoided when possible. All broadword algorithms we discuss contain no test and 
no branching. 

While broadword programming and careful consideration of testing and cache 
side-effects are by now quite common in practical implementations of succinct data 
structures (see, e.g., [DR06J), to the best of our knowledge no one has proposed 
broadword algorithms for the problems we study. Sec |Gog09| for other applications 
of the same ideas. 

We concentrate on 64-bit and wider architecture, but we cast all our algo- 
rithms in a 64-bit framework to avoid excessive notation: the modification for 
wider registers are trivial. We have in mind modern processors (in particular, the 
very common Opteron processor) in which multiplications are extremely fast (ac- 
tually, because the clock is slowed down in favour of multicores), so we use them 
occasionally. They can be safely replaced by 0(logu>) basic operation, but in prac- 
tice experiments show that on the Opteron replacing multiplications by shifts and 
additions, even in very small number, is not competitive. 

The C++/Java code implementing all data structures in this paper is available 



under the terms of the GNU Lesser General Public License at http://sux.dsi .umimi. it/ 



2 Notation 

Consider a string s of n bits numbered from 0. We write Si for the bit of index 
i. When can view s as a string of parentheses by stipulating that 1 represent an 
open parenthesis, and a closed parenthesis. We define the closed excess function 

E s{i) = \{sj | j < i A sj = 0}| - \{sj \j<iA Sj = l}\, 

which represent the excess of closed w.r.t. open parentheses at position i (excluded). 
The string s is balanced if the excess function is always negative, except for and 
n, where it is zero. 

We use a \ b to denote integer division of a by b, 3> and <C to denote right and 
left (zero-filled) shifting, 3> + denotes right shifting with sign extension, &, | and © 
to denote bit-by-bit not, and, or, and xor; x denotes the bit-by-bit complement of x. 
We pervasively use precedence to avoid excessive parentheses, and we use the same 
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precedence conventions of the C programming language: arithmetic operators come 
first, ordered in the standard way, followed by shifts, followed by logical operators; 
© sits between | and &. 

We use L k to denote the constant whose ones are in position 0, k, 2k, . . . 
that is, the constant with the lowest bit of each k-bit subword set (e.g, L§ = 
0x01010101010101010101). This constant is very useful both to spread values (e.g., 
0x12 * L§ = 0x1212121212121212) and to sum them up, as it generates cumula- 
tive sums of k-bit subwords if the values contained in each £;-bit subword, when 
added, do not exceed k bits, (e.g., 0x030702 *L 8 = 0x30A0C0C0C0C0C0C0902— 
look carefully at the three rightmost bytes). We use H k to denote L k <C k — 1, 
that is, the constant with the highest bit of each fc-bit subword set (e.g, H$ = 
0x8080808080808080). 

We use the notation 

» k : = (2 2m -l)\(2 2fc + l), 

where \ denotes integer division. More intuitively, /xo = 0x5555 . . . 5555, [i\ = 
0x3333 . . . 3333, ^ = OxOFOF . . . 0F0F, \x 2 = OxOOFF . . . 00FF, and so on. 

Our model is a RAM machine with w-b\t words that performs logic opera- 
tions, additions and subtractions in unit time using 2-complement arithmetic. In 
our algorithms we also use a constant number of multiplications, which can be 
substituted with 0(logu>) shifts and adds without altering the running time. 

3 Basic operations 

We recall the expression for computing in parallel the differences modulo 2 k of 
each fe-bit subword (see |Knu 07j): 

x - k y := ((x | H k ) - (y & H^)) © ((x © y) k H k ). 

If we know in advance that the blocks in x and y contain positive entries, this 
simplifies to 

((x | H k ) - y) © H k . 
Another important operation we will use is blockwise nonzero test: 

x^ k 0:= [{{{x\H k )-L k )\x))kH k . 

Finally, truncated difference of positive entries: 

x - k y = (x - k y) Sz ((x - k y) > k - 1) - k 1 

The subexpression after the & is simply a mask that cancels out every block in 
which a negative result was obtained. The common subexpression x — k y should 
be, of course, computed just once. 
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4 Matching open parentheses 



Assume we have a string s such that So = 1. We would like find the associated 
matching closed parenthesis, if it lies in s, or get some special value otherwise. The 
general strategy to obtain this result in 0(\ogw) time and O(l) additional space 
is to consider the excess function, as clearly we are interested in computing 

min E s (j) = 0. 

0<j<w 

We operate in the following manner: we will sample E s each 2 r io s lo s ^"1 positions. 
Then, we will scan linearly in parallel each of the resulting w/2^ l ° slogw ~\ blocks 
from the end, recording whether in some block the function crosses zero, and 
where this happens. Finally, we find the first block that hit a zero and return the 
corresponding position. 

Let us first consider a 64-bit sampling phase on input x; blocks are just bytes 
this case. We start with a small variant of the standard broadword algorithm for 
sideways additions: 



b = x - (x & OxAAAAAAAAAAAAAAAA) > 1 

1 b=(bSz 0x3333333333333333) + {(b > 2) & 0x3333333333333333) 

2 b = (b + (b > 4)) & OxOFOFOFOFOFOFOFOFO 

3 b = (b * L%) < 1 

At this point, each byte of b contains twice the number of open parentheses ap- 
pearing up to that block, included. Note that the excess function satisfies 

E s(j) = \{sj | j < iAsj = 0}\-\{ Sj | j < lAsj = 1}| =j-2\{ Sj | j < iAsj = 1}|, 

so getting a sample of E s each 8 bits just requires parallel subtraction with a 
suitable constant: 

b=(H s \ 0x4038302820181008) - 8 b. 

Note the presence of Hg, which avoid propagation of the sign bit, and in practice 
let us represent each sample in two's complement in the seven lower bits of each 
byte. We now set up an update mask u that contains, for each byte of b, zero, if 
the byte is nonzero in the lower seven bits, but 0x7F otherwise: 

u = ((((6 | H 8 ) - L 8 ) » 7 & L 8 )|flg) - L 8 

Using u we set up our last variable z, that throughout the computation will 
contain, for each byte of b, either zero, if the byte was never equal to zero (in 
the lower seven bits), or a counter expressing the position of the parenthesis that 
caused the excess function to go to zero. If we find a zero byte initially, the position 
is clearly 7: 

z = (H 8 > 1 | * 7) & u. 
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We now update b, modifying the values of the excess function two bits at a time: 
this is correct, as a balanced string has necessarily even length, so the excess 
function cannot go to zero at an odd position. In the first round we thus compute 

b = b - (L 8 * 2 - ((x > 6 & L 8 < 1) + (x > 5 & L$ < 1))) 

We now recompute u as above, but update z as follows: 

z = z k. u | (H$ > 1 | L 8 * 5) & u. 

Due to the update rule, even nonzero bytes of z will be updated. This is correct, 
as we want to find the zero of the excess function that is closer to the first bit. We 
continue in this way until we have completed scanning each byte: the next update 
of b is thus 

b = b - (L 8 * 2 - {(x > 4 & L 8 < 1) + (x > 3 & L 8 < 1))), 

and so on. Finally, we gather our result by locating the relevant block using an 
LSB operator (e.g., Brodal's [K nu07| ). which we assume to return —1 in case no 
bit is set: 



p = LSBO > 6 & L B ) 

1 ((p + (z > p & 0x3F)) | (p > 8))) & 0x7F 

The last line contains the expression returned (we will return 127 in case no match- 
ing parenthesis exists). 

The algorithm is best followed on an example: consider the first two bytes of 
a 64-bit string: 



11001011 00001010 



Note that the left most bit is bit zero. We are representing the string of parentheses 



The excess function behaves as follows: 




In the first computation step, we sample the excess function at each byte, so the 
first bytes of b (in two's complement) are -2 and +2. No result is thus stored in 
z. However, in the first update we modify the samples of the excess function by 
subtracting the contribution of the underlined parentheses: 



110 10 11 



1 1 ... 
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Now the first byte of b changes to 0, so we store in z our result as follows: 



01000101 00000000 



The sixth bit records that there is a value, and for the time being the candidate 
result is 5. Note that the current result is spurious, because there is another zero 
to be found. 

We now update again b, subtracting another pair of parentheses: 



110 10 11 



10 10 



The first byte of b is now 0, the second byte +2. Thus, z is updated as follows: 



01000011 00000000 



In the last update, the second byte of b becomes 0. The final value of z is thus 



01000011 01000001 



Now the LSB operator detects that the first zero is in the first byte, and the correct 
value (3) is extracted from z and returned. 

The construction of the first sample requires 0(loglogu>) instructions and a 
single multiplication (which could be substituted with 0(logw) operations). The 
parallel linear scan clearly requires 0(\ogw) operations. 



5 Finding far closed parentheses by index 

In this section we discuss the problem of finding the p-th far closed parenthesis. 
The simple combinatorial idea at the heart of the algorithm is the following, easily 
proved statement: 

Proposition 1 Let t, u be bit strings, #open/#closed the operators returning the 
number of far open/closed parenthesis in a string, and — truncated subtraction. 
Then: 

^opcn^'U' — (^opcn^ T^closed^) ^opcn^ 
^closed^"^ — (^closed^ ^open^) ~i~ ^closed** 

Since it is easy to compute the number of far open/closed parenthesis in 2- 
bit blocks using masking, and it is also easy to do parallel truncated subtraction, 
using additional 0(logw) words we can compute the number of far open/closed 
parentheses in blocks of length 2*, 2 < i < logw. At that point, we can use the 
above property backwards: if we are searching from the k-th far closed parenthesis 
in tu, this must be either in t, if k < #ciosed*> or in u, but in position A; — # c iosed* + 
#open*- We will assume for the time being that a p-th far closed parenthesis does 
exist in the string. Results will be unpredictable otherwise. 

In general, each 2 fc -bit block of the variables Ok and cj~ will keep track of the 
number of far open/closed parentheses in the corresponding 2 fc -bit block of the 
input x. We bootstrap our computation by filling o\ and c\\ 
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b = x k 0x5555555555555555 

1 h = {x k OxAAAAAAAAAAAAAAAA) > 1 

2 I = (60 61) k 61 

3 oi = (60 k 61) < 1 1 1 

4 ci = ((60 | 61) 0x5555555555555555) < 1 | Z 
These operations implements the mappings 

00 ^ 00 00 ->• 10 

01^01 01^01 

10 -^00 10 -> 00 

11^10 11^00 

They send each 2-bit substring to the number of far open, or closed, respectively, 
parentheses. 

The k-th phase, 1 < k < logw, records in temporary variables e Q and e c the 
number of far open and far closed parentheses in each half of 2 fe+1 -bit blocks. These 
numbers are then combined using Proposition [Tj 



e Q = o k & [i k 

1 e c = (c fc & n k < 2 fc ) > 2 k 

2 o fe+1 = ((o fc & R ,<2 fc )>2' c ) + (e - 8 e c ) 

3 Cfc+i = (cjfe & /ife) + (e c - 8 e ) 

Finally, we work backwards, isolating the part of the string containing the required 
parenthesis. At the k-th. step, k = log w — 1, log w — 2, ... ,1 we operate as follows: 



6= ((p - (c fc > s & 2 2k -l))>+w-l)-l 

1 m = 6&2 2fc -l 

2 p — = Cfc km 

3 p += o k km 

4 s+=2 fe &6 



The variable s keeps track of the left (i.e., lowest) extreme of the interval of width 
2 fc+1 in which we are performing our binary search. Initially, s is zero and k = 
logw — 1, which means that we are searching for the p-th far closed parenthesis in 
the whole string s. 

In each phase, we first of all set 6 so that it is if the p-th far closed parenthesis 
appears in the block of length 2 k starting at position s, otherwise. Note that we 
can do this because the far closed parentheses in the first half are true, global far 
closed parentheses. We then set up our mask m, which will be used to update p: 
if m is zero, there is no update to do — we just have to restrict our search interval. 
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Otherwise, we have to decrease p by Ck & m (as we are skipping Ck & m far closed 
parentheses) and increase it by & m (as there are o& &z m far open parentheses 
before the block we're moving in, so we must offset p). Finally, s must be updated 
and moved forward by 2 k in case b 7^ 0. 

In the last phase, we are left with a two-bits string and a value p. It is easy to 
check that the following hand-crafted expression gives the correct result: 

s+p+((x>s&((p<l) I 1))<1). 

Finally, it is easy to see that be performing an additional phase in the first part 
of the algorithm we can obtain the overall number of far closed parentheses in 
the whole string, making it easy to return a special value in case the requested 
parenthesis does not exist. 

6 Experiments 

We performed a number of experiments on a Linux-based system sporting a 64-bit 
Opteron processor running at 2814.501 MHz with 1 MiB of first-level cache. The 
tests show that on 64-bit architectures broadword programming provides signifi- 
cant performance improvements. We compiled using gcc 4.1.2 and options -09. 

Our previous experience with similar code shows that testing in isolation very 
tight code can produce paradoxical results. It is much more informative to embed 
the code in a typical simple application: in our case, we implemented Jacob- 
son's classical 0(n) balanced parentheses representation |Jac89j and performed 
tests measuring the time required to find a matching closed parenthesis using our 
broadword algorithms and a tuned for-loop implementation. 

The experimental setting for benchmarking operations that require nanosec- 
onds must be set up carefully. We generate at random bit arrays containing cor- 
rectly parenthesised strings, and store a million test positions. During the tests, 
the positions are read with a linear scan, producing minimal interference; gener- 
ating random positions during the tests causes instead a significant perturbation 
of the results, mainly due to the slowness of the modulo operator. The tests are 
repeated ten times and averaged. We measure user time using the system function 
getrusage () . 

Generating random balanced strings of parenthesis requires some attention. We 
use Arnold and Sleep's classical algorithm [AS80], but with a twist. The algorithm 
chooses at each step whether to add a closed parenthesis with probability 

_ l r(fc + r + 2) 
r ' k ~ 2 k(r + 1) ' 

where r is the number of open parentheses still to be closed, and k the remaining 
number of symbols to be generated. Note that when k = r we have P r ^ = 1, so 
we just generate closed parentheses. 
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.75 


.50 


.25 


IKi 


62.10/89.90 


68.50/ 


'116.80 


76.50/ 


'130.40 


86.50/ 


'142.60 


4Ki 


62.90/95.40 


68.80/ 


'115.00 


77.30/ 


'123.60 


87.70/ 


'152.20 


16 Ki 


63.10/ 


100.30 


68.70/ 


'113.50 


78.00/ 


'127.20 


87.10/ 


'153.20 


64 Ki 


63.70/ 


100.70 


69.40/ 


'113.10 


79.00/ 


'128.50 


88.20/ 


'154.30 


256 Ki 


69.20/ 


105.40 


75.50/ 


'119.50 


86.20/ 


'134.70 


96.00/ 


'161.60 


IMi 


78.70/ 


116.30 


87.50/ 


'130.30 


97.00/ 


'144.90 


109.30/ 


'173.50 


4 Mi 


179.20/ 


213.20 


190.20/ 


'231.60 


211.80/ 


'261.50 


237.40/ 


'301.20 


16 Mi 


246.30/ 


278.20 


281.50/ 


'320.60 


327.50/ 


'376.10 


424.30/ 


'489.30 



Table 1: Timings in nanoseconds for a parenthesis matching operation in Jacob- 
son's data structure. The first value is obtained used the broadword algorithms 
presented in this paper, whereas the second value is obtained using a for-loop im- 
plementation. Column labels show the amount of twisting, whereas row labels 
show the number of parentheses in the string. 



To estimate better the behaviour of our algorithms, we introduce a twist, that 
is, a number < t < 1 that shifts the probability so that open parentheses are 
more likely to be generated. In other words, 

p Jl ifPr,* = l; 

I tP rj k otherwise. 

The result is that when t < 1 we will tend to generate strings with deeper nesting. 
We are interested in experimenting with the behaviour at different deepness levels 
because trivial (for-loop) solutions behave very well on random strings because 
most open parentheses are near, and moreover their matching parenthesis is a few 
bits away. But if you consider a typical application, for instance, binary search 
trees, then a search going down into a large tree has to find a far matching paren- 
thesis for most of the search. More precisely, it is not difficult to see that for a 
complete binary tree the average (over all paths going from the root to a leaf) dis- 
tance between the open and closed parenthesis of a query is 0(n/ log n) (assuming 
the binary tree is mapped to a forest using the inverse of the first-child/next-sibling 
isomorpshism, and that the forest is represented using balanced parentheses in the 
standard way). To simulate this fact, we use a skewed distribution: we plan to 
enlarge, however, our test set with more realistic large search trees or XML trees. 

We compare our structures against tuned for-loop implementations: results are 
shown in Table [T] and Figure [TJ which clearly show the advantage of the broadword 
implementation, in particular for longer matchings (e.g., for low twist). We expect, 
of course, that figures will improve as w gets larger. 
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Figure 1: A graph displaying the data shown in Table [T] Up to around one 
million bit the timings remain constant even in practice; after that, memory access 
becomes significant and size has a significant effect on speed (as in the case of 
rank/select queries — see [GGMN05J). 
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7 Conclusions 



Extending some previous work of ours |Vig08 , we have introduced some two new 
broadword algorithms that implement two basic operations typical of succinct 
static data structures for balanced parentheses. We have also presented experi- 
ments that compares our results with a for-loop baseline. We discussed our algo- 
rithms in the case of closed parentheses, but they can be immediately modified to 
find matching open or far open parentheses. 

We leave for future work experimentation with tabulated implementations. The 
latter tend to be, of course, very fast when tested, but they engage the processor 
cache significantly, and their global impact cannot be measured easily. For-loop 
implementations have a cache footprint similar to that of our broadword versions, 
so they are first natural candidate for comparisons. 

A Java version of this code is currently distributed by the Sux4J project^] as 
part of a highly compressed implementation of a monotone minimal perfect hash 
function (see |BBPV09j ). 
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